08 July, 2019

Proudly supported by the
people of Western Australia
through Channel 7's Telethon

Statistical Modelling in R

Notes

order of included R functions:

  • lm()
  • summary()
  • ggplot() + statqq()
  • ggplot() + geom_smooth()
  • tidy(), augment(), glance()
  • summ(), plot_summs()
  • autoplot()
  • glm() (different families)

What we cover

  • Linear Regression
  • Multiple Linear Regression
  • Logistic Regression

Linear Regression

Linear Regression in R

In a linear regression, we aim to find a model:

  • that represents our data and
  • can give information about the association between our variables of interest.

The command in R for a linear model is

lm(y~x).

y is the outcome variable that interests us and x is the variable that we want to test in its association with y

Example:

Data set summary

Let’s first have a look at the summary table of the example data set, by using the summary() command:

##       day1             day2               day3      
##  Min.   :-1.611   Min.   :  0.2396   Min.   : 0.00  
##  1st Qu.: 3.031   1st Qu.:  9.9763   1st Qu.:13.25  
##  Median : 5.619   Median : 15.9973   Median :19.37  
##  Mean   : 5.428   Mean   : 16.7252   Mean   :20.27  
##  3rd Qu.: 7.772   3rd Qu.: 20.6984   3rd Qu.:28.05  
##  Max.   :12.142   Max.   :143.8187   Max.   :39.03  
##                   NA's   :4          NA's   :16

Visualisation of data distributions

Helpful plots before modelling

Before we start with the linear regression model, we need to get an idea of the underlying data and its distribution. We know that the linear regression has the assumtptions:

QQ-plot:

tki_demo %>%
  filter(day2 < 100) %>%
  gather(Days, measurement, day1:day3, factor_key=TRUE) %>%
  ggplot( aes(sample=measurement, color=Days)) + stat_qq() + facet_wrap(~Days)

Boxplots to check for outliers

Plot the variables

Linear Regression

The lm() function

In the plots, we could already see, that petal length and petal width seem to be associated. This is obvious when drawing a line in the plot. Let’s now perform a linear regression model in R.

lm(Petal.Length~Petal.Width, data=iris)

  • As said before, the first argument in the code is y, our outcome variable or dependent variable. In this case it is Petal.Length.

  • The second Argument is x, the independent variable. In our case: Petal.Width.

  • We also specify the data set that holds the variables we specified as x and y.

Linear Regression Results

Now we want to look at the results of the linear regression. So how do we get the p-value and \(\beta\)-coefficient for the association?

In R, we add the summary() function to the lm() function, like so:

summary(lm(y~x, data=data))

Example Results

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33542 -0.30347 -0.02955  0.25776  1.39453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08356    0.07297   14.85   <2e-16 ***
## Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

jtools and broom

Improving the accessibility of the lm() results

The output before contains a lot of relevant information, but it is not straighforward to access the individual parameters like p-values and betas The broom R package is in line with the “tidy” data handling in R and turns the linear model results into an easy accessible tibble format:

tidy(lm1)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     1.08    0.0730      14.8 4.04e-31
## 2 Petal.Width     2.23    0.0514      43.4 4.68e-86

Improving output style

The broom package helps with the accessibility of the output, but the style of the output is not very appealing for a publiation or a report. The jtools package helps with this and has other nice functionalities such as forrest plots for coefficients and confidence intervals:

export_summs(lm1)
Model 1
(Intercept) 1.08 ***
(0.07)   
Petal.Width 2.23 ***
(0.05)   
N 150       
R2 0.93    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Diagnostics

After running the linear regression, we need to test the quality of the model. At first, we look at the R2 statistic. The R2 statistic is a measure of how much variance in the data was explained by our model.



The R2 statistic ranges from 0 to 1, where 1 means the model explains 100% of the variance. This means the model fits the data perfectly.

Diagnostic Plots

library(ggfortify)
autoplot(model)

Multiple Linear Regression

How to?

Multiple linear regression works with the same function inR : lm(y ~ x + covar1 + covar2 + … + covarx , data=data) The R standard output is also very messy for reports, but helps with a first visual inspection in the command line:

summary(lm(day3 ~ day2 + male + intervention, data=tki_demo))
## 
## Call:
## lm(formula = day3 ~ day2 + male + intervention, data = tki_demo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.227  -1.813  -0.192   1.687   9.279 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         19.28189    1.18213  16.311  < 2e-16 ***
## day2                -0.01281    0.03689  -0.347    0.729    
## maleTRUE             1.48829    1.20570   1.234    0.221    
## interventionDrug 2   9.26868    1.33198   6.959 1.11e-09 ***
## interventionPlacebo -9.26457    1.40139  -6.611 4.93e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.771 on 75 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.726,  Adjusted R-squared:  0.7113 
## F-statistic: 49.67 on 4 and 75 DF,  p-value: < 2.2e-16

Example 1: one model

With the demo data set:

export_summs(lm(day3 ~ day2 + male + smoker, data = tki_demo))
Model 1
(Intercept) 15.81 ***
(1.46)   
day2 0.16 ** 
(0.06)   
maleTRUE 7.11 ***
(1.89)   
smokerTRUE -3.67    
(2.03)   
N 80       
R2 0.24    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Example 2: more models

With the demo data set:

lm1 <- lm(day3 ~ day2, data = tki_demo)
lm2 <- lm(day3 ~ day2 + male + smoker, data = tki_demo)
export_summs(lm1, lm2)

Model 1 Model 2
(Intercept) 17.25 *** 15.81 ***
(1.41)    (1.46)   
day2 0.16 **  0.16 ** 
(0.06)    (0.06)   
maleTRUE         7.11 ***
        (1.89)   
smokerTRUE         -3.67    
        (2.03)   
N 80        80       
R2 0.09     0.24    
*** p < 0.001; ** p < 0.01; * p < 0.05.
## Forrest plot to compare coefficients in the model Often we want to visualise the coefficients in the model to see their impact on the outcome, or visualise the coefficient of specific variable in two models, that differ only in the adjusted covariates. The jtools package has a nice function to do this very easily, utilising ggplot2:
plot_summs()

Example 1: one model

plot_summs(lm1)

Example 2: more models

plot_summs(lm1, lm2)

Example 3: compare specific coefficients

plot_summs(lm1, lm2, coefs="day2")

Getting more info: Confidence Intervalls

With jtools we can access more information from the model in an easier step. Here, we access the confidence intervall and variance inflation factor (for multicollinearity testing), but leave out the p-values:

summ(lm2, scale = TRUE, vifs = TRUE, part.corr = TRUE, confint = TRUE, pvals = FALSE)$coeftable
##                  Est.       2.5%      97.5%    t val.      VIF  partial.r
## (Intercept) 18.569528 16.2225902 20.9164657 15.758586       NA         NA
## day2         2.568780  0.7826129  4.3549470  2.864328 1.021056  0.3121443
## maleTRUE     7.107081  3.3480942 10.8660688  3.765636 1.041817  0.3965366
## smokerTRUE  -3.665084 -7.7050377  0.3748704 -1.806864 1.054610 -0.2029483
##                 part.r
## (Intercept)         NA
## day2         0.2862919
## maleTRUE     0.3763784
## smokerTRUE  -0.1805975

Logistic regression